Spelling Error Trends in Urdu
نویسندگان
چکیده
Today the most accurate error correction techniques are statistical. But for low resourced languages like Urdu, where training error corpora are not available, statistical techniques are out of the question. Rule based techniques that exploit spelling error trends provide a useful alternative. The study of error patterns in a language is an essential prerequisite for designing such techniques. This paper presents two studies of spelling error trends in Urdu. The results show that alongside the already known spelling error trends common to all languages, Urdu also exhibits some language specific error patterns. The most important among them are space related errors and shape similarity based errors. They form a dominating portion of the total spelling mistakes in Urdu. 1. Literature Review Until recently, most of the spelling correction techniques were designed on the basis of spelling errors trends (also called error patterns); therefore many studies were performed to analyze the types and the trends of spelling errors. The most notable among these are the studies performed by Damerau [1] and Peterson [4]. According to these studies Spelling errors are generally divided into two types, typographic errors and cognitive errors. Typographic errors occur when the correct spelling of the word is known but the word is mistyped by mistake. These errors are mostly related to the keyboard and therefore do not follow any linguistic criteria. A study by Damerau [1] shows that 80% of the typographic errors fall into one of the following four categories 1. Single letter insertion; e.g. typing acress for cress 2. Single letter deletion, e.g. typing acress for actress 3. Single letter substitution, e.g. typing acress for across 4. Transposition of two adjacent letters, e.g. typing acress for caress The errors produced by any one of the above editing operations are also called single-errors [2]. Damerau’s assertion was confirmed later by a number of researchers including Peterson [4]. The results of a study by Peterson [4] are shown in Table 1. The data sources were Webster’s Dictionary and Government Printing Office (GPO) documents retyped by college students. The rows in Table 1 correspond to four basic types of errors; the columns correspond to the two sources of data. For each data source, the number and the percentage of each type of errors is given. The last row contains total number and percentage of single errors. Table 1. Statistics of the Four Basic Types of Errors (for English). GPO Web7 Transposition 4 (2.6%) 47 (13.1%) Insertion 29 (18.7%) 73 (20.3) Deletion 49 (31.6%) 124 (34.4%) Substitution 62 (40.0%) 97 (26.9%) Total 144 (92.9%) 341 (94.7%) Typographic errors are mainly caused due to keyboard adjacencies. The most common of these typographic errors is the substitution error (as shown in 4th row of Table 1). Substitution error occurs when a letter is replaced by some other letter whose key on the keyboard is adjacent to the originally intended letter’s key. In a study referred to by Kukich [2], 58% of the errors involved adjacent typewriter keys. According to Peterson [4] the next most common errors are two extra letters, two missing letters and transposition of two letters around a third one. The errors, produced by more than one editing operations, are called multi-errors. [2] Cognitive errors occur when the correct spellings of the word are not known. In the case of cognitive errors, the pronunciation of misspelled word is the same or similar to the pronunciation of the intended correct word. (e.g. receive -> recieve, abyss -> abiss etc.) In a study, referred to by Kukich [2], Dutch researchers let 10 subjects transcribe the 123 recordings of Dutch surnames, 38% of these transcriptions were incorrect despite being phonetically plausible. In another study, referred to by Kukich [2], done on spelling errors trends in students of different grades, considering only those mistakes whose frequency was greater than 5, it was found that 64.69% were phonetically correct and another 13.97% were almost phonetically correct. It was postulated that errors with lower frequency have a tendency to be less phonetic. 2. Spelling Error trends in Urdu Two studies were performed to identify error patterns in Urdu. Due to the difference in the nature of the data and in the methodology used for studying the data, the two studies are discussed separately. Study 1 is also discussed in [3].
منابع مشابه
Analysis of Sindhi Spelling Error Patterns for Spelling Error Detection and Correction
Statistical analysis of spelling error trends in a language plays important role in automatic spelling error detection and correction. Comprehensive statistical analysis of spelling error trends for Sindhi is still subject of research. This research study identifies and analyses the spelling error trends in Sindhi. The statistical analysis of error trends is based on a real time corpus collecte...
متن کاملHindi to Urdu Conversion: Beyond Simple Transliteration
This paper incorporates a detailed analysis of existing work on Hindi to Urdu transliteration systems and finds the enhancements they required. It lists the issues that are beyond the scope of character by character mapping. The issues include multiple same sound Urdu characters against one Hindi character. Moreover, it deals with the issues when the same word or words are written in two differ...
متن کاملSpelling Error Trends and Patterns in Sindhi
Statistical error Correction technique is the most accurate and widely used approach today, but for a language like Sindhi which is a low resourced language the trained corpora’s are not available, so the statistical techniques are not possible at all. Instead a useful alternative would be to exploit various spelling error trends in Sindhi by using a Rule based approach. For designing such tech...
متن کاملAnalyzing Urdu Social Media for Sentiments using Transfer Learning with Controlled Translations
The main aim of this work is to perform sentiment analysis on Urdu blog data. We use the method of structural correspondence learning (SCL) to transfer sentiment analysis learning from Urdu newswire data to Urdu blog data. The pivots needed to transfer learning from newswire domain to blog domain is not trivial as Urdu blog data, unlike newswire data is written in Latin script and exhibits code...
متن کاملDesign and implementation of Persian spelling detection and correction system based on Semantic
Persian Language has a special feature (grapheme, homophone, and multi-shape clinging characters) in electronic devices. Furthermore, design and implementation of NLP tools for Persian are more challenging than other languages (e.g. English or German). Spelling tools are used widely for editing user texts like emails and text in editors. Also developing Persian tools will provide Persian progr...
متن کامل